Emphatic TD Bellman Operator is a Contraction

نویسندگان

  • Assaf Hallak
  • Aviv Tamar
  • Shie Mannor
چکیده

Recently, Sutton et al. (2015) introduced the emphatic temporal differences (ETD) algorithm for off-policy evaluation in Markov decision processes. In this short note, we show that the projected fixed-point equation that underlies ETD involves a contraction operator, with a √ γ-contraction modulus (where γ is the discount factor). This allows us to provide error bounds on the approximation error of ETD. To our knowledge, these are the first error bounds for an off-policy evaluation algorithm under general target and behavior policies.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

True Online Emphatic TD(λ): Quick Reference and Implementation Guide

TD(λ) is the core temporal-difference algorithm for learning general state-value functions (Sutton 1988, Singh & Sutton 1996). True online TD(λ) is an improved version incorporating dutch traces (van Seijen & Sutton 2014, van Seijen, Mahmood, Pilarski & Sutton 2015). Emphatic TD(λ) is another variant that includes an “emphasis algorithm” that makes it sound for off-policy learning (Sutton, Mahm...

متن کامل

True Online Emphatic TD($\lambda$): Quick Reference and Implementation Guide

This document is a guide to the implementation of true online emphatic TD(λ), a model-free temporal-difference algorithm for learning to make long-term predictions which combines the emphasis idea (Sutton, Mahmood & White 2015) and the true-online idea (van Seijen & Sutton 2014). The setting used here includes linear function approximation, the possibility of off-policy training, and all the ge...

متن کامل

O2TD: (Near)-Optimal Off-Policy TD Learning

Temporal difference learning and Residual Gradient methods are the most widely used temporal difference based learning algorithms; however, it has been shown that none of their objective functions are optimal w.r.t approximating the true value function V . Two novel algorithms are proposed to approximate the true value function V . This paper makes the following contributions: • A batch algorit...

متن کامل

The Curvature of a Single Contraction Operator on a Hilbert Space ∗

This note studies Arveson’s curvature invariant for d-contractions T = (T1, T2, . . . , Td) for the special case d = 1, referring to a single contraction operator T on a Hilbert space. It establishes a formula which gives an easy-to-understand meaning for the curvature of a single contraction. The formula is applied to give an example of an operator with nonintegral curvature. Under the additio...

متن کامل

Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis

We consider the off-policy evaluation problem in Markov decision processes with function approximation. We propose a generalization of the recently introduced emphatic temporal differences (ETD) algorithm (Sutton, Mahmood, and White, 2015), which encompasses the original ETD(λ), as well as several other off-policy evaluation algorithms as special cases. We call this framework ETD(λ, β), where o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1508.03411  شماره 

صفحات  -

تاریخ انتشار 2015